Introduction
In this case study, we analyzed two data sets provided by our client Budweiser, the Beers dataset contains a list of 2410 US craft beers and Breweries dataset contains 558 US breweries. The purpose of the code is to find statistical characters of IBU and ABV data, also the correlation between them to address questions from Budweriser.
Repository Structure
Case study Repository in GitHub: https://github.com/48120778/EDAforProject1.git Files: Reers.csv: Source beer data. Breweries.csv: Source breweries data. EDAforProject1_R3.RMD: R Markdown file with all R code. EDAforProject1_R3.HTML: HTML file created with Knit.
Analysis
1. How many breweries are present in each state?
Number of breweries in 51 states has been calculated and provided as the below table, together with a plot with numbers in order.
##
## AK AL AR AZ CA CO CT DC DE FL GA HI IA ID IL IN KS KY
## 7 3 2 11 39 47 8 1 2 15 7 4 5 5 18 22 3 4
## LA MA MD ME MI MN MO MS MT NC ND NE NH NJ NM NV NY OH
## 5 23 7 9 32 12 9 2 9 19 1 5 3 3 4 2 16 15
## OK OR PA RI SC SD TN TX UT VA VT WA WI WV WY
## 6 29 25 5 4 1 3 28 4 16 10 23 20 1 4
Brew_Num <- BreweriesDF %>%
group_by(State) %>%
dplyr::summarize(count=n())
# Plot
Brew_Num$State = factor(Brew_Num$State, level = Brew_Num$State[order(-Brew_Num$count)])
Brew_Num %>%
ggplot(aes(x=State, y=count, fill=State))+geom_bar(stat = "identity", width = 0.8) +
ylab("Brewery")+ggtitle("Brewery vs State")+
geom_text(aes(label=count), hjust=0.4, vjust=-0.3, size=3) +
theme(axis.text.x = element_text(size=7, angle=90))2. Merge beer data with the breweries data. Print the first 6 observations and the last six observations to check the merged file.
We merged the two data set into one and named it as ‘mergeDF’, the primary key being used is ‘Brewery_id’ from Beer data set, and ‘Brew_ID’ from Breweries data set. We also changed the two columns’ name for clear understanding. The first and last 6 observations were showed there with head/tail command.
mergeDF <- merge(BeersDF, BreweriesDF, by.x = "Brewery_id", by.y = "Brew_ID")
colnames(mergeDF)[2] = "Beer_name"
colnames(mergeDF)[8] = "Brewery_name"
print("The first 6 observations of the merged file:")## [1] "The first 6 observations of the merged file:"
## Brewery_id Beer_name Beer_ID ABV IBU
## 1 1 Get Together 2692 0.045 50
## 2 1 Maggie's Leap 2691 0.049 26
## 3 1 Wall's End 2690 0.048 19
## 4 1 Pumpion 2689 0.060 38
## 5 1 Stronghold 2688 0.060 25
## 6 1 Parapet ESB 2687 0.056 47
## Style Ounces Brewery_name
## 1 American IPA 16 NorthGate Brewing
## 2 Milk / Sweet Stout 16 NorthGate Brewing
## 3 English Brown Ale 16 NorthGate Brewing
## 4 Pumpkin Ale 16 NorthGate Brewing
## 5 American Porter 16 NorthGate Brewing
## 6 Extra Special / Strong Bitter (ESB) 16 NorthGate Brewing
## City State
## 1 Minneapolis MN
## 2 Minneapolis MN
## 3 Minneapolis MN
## 4 Minneapolis MN
## 5 Minneapolis MN
## 6 Minneapolis MN
## [1] "The last 6 observations of the merged file:"
## Brewery_id Beer_name Beer_ID ABV IBU
## 2405 556 Pilsner Ukiah 98 0.055 NA
## 2406 557 Heinnieweisse Weissebier 52 0.049 NA
## 2407 557 Snapperhead IPA 51 0.068 NA
## 2408 557 Moo Thunder Stout 50 0.049 NA
## 2409 557 Porkslap Pale Ale 49 0.043 NA
## 2410 558 Urban Wilderness Pale Ale 30 0.049 NA
## Style Ounces Brewery_name
## 2405 German Pilsener 12 Ukiah Brewing Company
## 2406 Hefeweizen 12 Butternuts Beer and Ale
## 2407 American IPA 12 Butternuts Beer and Ale
## 2408 Milk / Sweet Stout 12 Butternuts Beer and Ale
## 2409 American Pale Ale (APA) 12 Butternuts Beer and Ale
## 2410 English Pale Ale 12 Sleeping Lady Brewing Company
## City State
## 2405 Ukiah CA
## 2406 Garrattsville NY
## 2407 Garrattsville NY
## 2408 Garrattsville NY
## 2409 Garrattsville NY
## 2410 Anchorage AK
3. Address the missing values in each column.
The missing values can only be found in column IBU (1005 NA’s) and column ABV (62 NA’s), refer to red bars in the first plot. As we don’t want to delete all the missing values, we started with Option 1 to deal with the missing values, in which the state level median values were calculated for both IBU and ABV and used to replace missing values respectively, but we found that the correlation between the IBU and ABV would be heavily impacted due to this replacement.
Option 1: Simply replacing missing values in IBU/ABV with state median results in significant drop on correlation, from 67% to 49% (Refer to #7)
IBUNADF <- mergeDF %>% filter(is.na(IBU))
ABVNADF <- mergeDF %>% filter(is.na(ABV))
BOTHNADF <- mergeDF %>% filter(is.na(ABV) & is.na(IBU))
# Check which state give the most NA's
summary(IBUNADF$State)## AK AL AR AZ CA CO CT DC DE FL GA HI IA ID IL IN KS KY
## 8 1 4 23 48 119 21 4 1 21 9 9 5 13 52 48 4 7
## LA MA MD ME MI MN MO MS MT NC ND NE NH NJ NM NV NY OH
## 9 31 11 20 124 9 13 0 17 29 0 16 6 0 8 3 28 17
## OK OR PA RI SC SD TN TX UT VA VT WA WI WV WY
## 8 38 53 7 9 7 1 41 15 5 10 25 45 0 3
## AK AL AR AZ CA CO CT DC DE FL GA HI IA ID IL IN KS KY
## 25 10 5 47 183 265 27 8 2 58 16 27 30 30 91 139 23 21
## LA MA MD ME MI MN MO MS MT NC ND NE NH NJ NM NV NY OH
## 19 82 21 27 162 55 42 11 40 59 3 25 8 8 14 11 74 49
## OK OR PA RI SC SD TN TX UT VA VT WA WI WV WY
## 19 125 100 27 14 7 6 130 26 40 27 68 87 2 15
## AK AL AR AZ CA CO CT DC DE FL GA HI IA ID IL IN KS KY
## 0 0 0 3 1 15 0 0 1 2 0 0 0 0 0 2 0 1
## LA MA MD ME MI MN MO MS MT NC ND NE NH NJ NM NV NY OH
## 0 0 0 0 11 0 3 0 1 4 0 4 0 0 1 1 1 0
## OK OR PA RI SC SD TN TX UT VA VT WA WI WV WY
## 0 0 3 0 0 0 0 6 0 0 0 0 2 0 0
# NADF <- data.frame(State=State, ALL=summary(mergeDF$State), IBUNA=summary(IBUNADF$State), ABVNA=summary(ABVNADF$State))
# NADF$IBUNAperc <- NADF$IBUNA/NADF$ALL
# NADF$ABVNAperc <- NADF$ABVNA/NADF$ALL
# Calculate IBU and ABV median value of each state
IBUmedianState <- mergeDF %>% filter(!is.na(IBU)) %>% group_by(State) %>% dplyr::summarise(IBUmedian=median(IBU))
ABVmedianState <- mergeDF %>% filter(!is.na(ABV)) %>% group_by(State) %>% dplyr::summarise(ABVmedian=median(ABV))
# Replace IBU NA's with state median value
mergeDF_no_IBU_NA <- left_join(mergeDF, IBUmedianState, by = "State")
mergeDF_no_IBU_NA$IBU[is.na(mergeDF_no_IBU_NA$IBU)] <- as.character(mergeDF_no_IBU_NA$IBUmedian[is.na(mergeDF_no_IBU_NA$IBU)])
aggr(mergeDF_no_IBU_NA, prop=FALSE, numbers=TRUE) # Still 7 observations IBU NA, because there is no any IBU value available for state # Replace 7 IBU NA's in SD with grand median value 35.00
mergeDF_no_IBU_NA$IBU = str_replace_na(mergeDF_no_IBU_NA$IBU, replacement = "35")
mergeDF_no_IBU_NA$IBUmedian = str_replace_na(mergeDF_no_IBU_NA$IBUmedian, replacement = "35")
# Replace ABV NA's with state median value
mergeDF_noNA <- left_join(mergeDF_no_IBU_NA, ABVmedianState, by = "State")
mergeDF_noNA$ABV[is.na(mergeDF_noNA$ABV)] <- as.character(mergeDF_noNA$ABVmedian[is.na(mergeDF_noNA$ABV)])
aggr(mergeDF_noNA, prop=FALSE, numbers=TRUE)mergeDF_noNA$IBU <- as.numeric(mergeDF_noNA$IBU)
mergeDF_noNA$ABV <- as.numeric(mergeDF_noNA$ABV)
# Plot
IBU.plotly <- mergeDF_noNA %>% ggplot(aes(x=IBU))+geom_bar()+ggtitle("IBU Distribution") +
theme(axis.text.x = element_text(size = 6, angle = 90))
ggplotly(IBU.plotly)ABV.plotly <- mergeDF_noNA %>% ggplot(aes(x=ABV))+geom_bar()+ggtitle("ABV Distribution") +
theme(axis.text.x = element_text(size = 6, angle = 90))
ggplotly(ABV.plotly)## Brewery_id Beer_name Beer_ID
## Min. : 1.0 Nonstop Hef Hop : 12 Min. : 1.0
## 1st Qu.: 94.0 Dale's Pale Ale : 6 1st Qu.: 808.2
## Median :206.0 Oktoberfest : 6 Median :1453.5
## Mean :232.7 Longboard Island Lager: 4 Mean :1431.1
## 3rd Qu.:367.0 1327 Pod's ESB : 3 3rd Qu.:2075.8
## Max. :558.0 Boston Lager : 3 Max. :2692.0
## (Other) :2376
## ABV IBU Style
## Min. :0.00100 Min. : 4.00 American IPA : 424
## 1st Qu.:0.05000 1st Qu.: 26.00 American Pale Ale (APA) : 245
## Median :0.05700 Median : 35.00 American Amber / Red Ale : 133
## Mean :0.05973 Mean : 39.81 American Blonde Ale : 108
## 3rd Qu.:0.06700 3rd Qu.: 47.00 American Double / Imperial IPA: 105
## Max. :0.12800 Max. :138.00 American Pale Wheat Ale : 97
## (Other) :1298
## Ounces Brewery_name City
## Min. : 8.40 Brewery Vivant : 62 Grand Rapids: 66
## 1st Qu.:12.00 Oskar Blues Brewery : 46 Portland : 64
## Median :12.00 Sun King Brewing Company : 38 Chicago : 55
## Mean :13.59 Cigar City Brewing Company: 25 Indianapolis: 43
## 3rd Qu.:16.00 Sixpoint Craft Ales : 24 San Diego : 42
## Max. :32.00 Hopworks Urban Brewery : 23 Boulder : 41
## (Other) :2192 (Other) :2099
## State IBUmedian ABVmedian
## CO : 265 Length:2410 Min. :0.04000
## CA : 183 Class :character 1st Qu.:0.05500
## MI : 162 Mode :character Median :0.05700
## IN : 139 Mean :0.05677
## TX : 130 3rd Qu.:0.05800
## OR : 125 Max. :0.06250
## (Other):1406
Option 2: State level median values were only used to replace ABV missing values (62 NA’s in ABV), then we predicted the IBU values (1005 NA’s) with a Simple Linear Regression model base on the exporatory variable ABV, this way we keep the correlation between IBU/ABV no big change. This is the option that we used to get the data frame ‘tidyDF’, which is used to address the following questions.
# Calculate ABV median value of each state
ABVmedianState <- mergeDF %>% filter(!is.na(ABV)) %>% group_by(State) %>%
dplyr::summarise(ABVmedian=median(ABV))
# Replace 62 NA's in ABV with state level median
mergeDF_no_ABV_NA <- left_join(mergeDF, ABVmedianState, by = "State")
mergeDF_no_ABV_NA$ABV[is.na(mergeDF_no_ABV_NA$ABV)] <-
mergeDF_no_ABV_NA$ABVmedian[is.na(mergeDF_no_ABV_NA$ABV)]
aggr(mergeDF_no_ABV_NA, prop=FALSE, numbers=TRUE)# Build simple regression model with 1405 pairs of IBU/ABV to get the correlation coef.
mergeDF_tidy <- mergeDF %>% filter(!is.na(IBU) & !is.na(ABV))
mod <- lm(mergeDF_tidy$IBU ~ mergeDF_tidy$ABV)
summary(mod)##
## Call:
## lm(formula = mergeDF_tidy$IBU ~ mergeDF_tidy$ABV)
##
## Residuals:
## Min 1Q Median 3Q Max
## -78.849 -11.977 -0.721 13.997 93.458
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -34.099 2.326 -14.66 <2e-16 ***
## mergeDF_tidy$ABV 1282.037 37.860 33.86 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.26 on 1403 degrees of freedom
## Multiple R-squared: 0.4497, Adjusted R-squared: 0.4493
## F-statistic: 1147 on 1 and 1403 DF, p-value: < 2.2e-16
Intercept <- mod$coefficients[1]
Slope <- mod$coefficients[2]
# Add new column predicted_IBU base on correlation coef.
# Change negative predicted_IBU to 0, IBU is measured on a scale from 0 to infinite
mergeDF_no_ABV_NA$predict_IBU <- pmax(Intercept + Slope*mergeDF_no_ABV_NA$ABV, 0)
summary(mergeDF_no_ABV_NA$IBU)## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 4.00 21.00 35.00 42.71 64.00 138.00 1005
# Replace 1005 NA's in IBU with predicted_IBU
mergeDF_no_ABV_NA$IBU[is.na(mergeDF_no_ABV_NA$IBU)] <-
mergeDF_no_ABV_NA$predict_IBU[is.na(mergeDF_no_ABV_NA$IBU)]
summary(mergeDF_no_ABV_NA$IBU)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 26.00 36.41 42.49 55.64 138.00
4. Compute the median alcohol content(ABV) and international bitterness unit (IBU) for each state. Plot a bar chart to compare.
Median ABV and IBU data have been sorted and plotted.
# plot IBU median distribution using IBUmedianState data frame
## Calculate IBU and ABV median value of each state
IBUmedianState <- tidyDF %>% filter(!is.na(IBU)) %>% group_by(State) %>% dplyr::summarise(IBUmedian=median(IBU))
ABVmedianState <- tidyDF %>% filter(!is.na(ABV)) %>% group_by(State) %>% dplyr::summarise(ABVmedian=median(ABV))
## Plot
IBUmedianState$State <- factor(IBUmedianState$State, level = IBUmedianState$State[order(IBUmedianState$IBUmedian)])
IBUmedianState %>% ggplot(aes(x=State, y=IBUmedian, fill=State)) + geom_bar(stat = "identity") + ggtitle("IBU Median in States")+
geom_text(aes(label=IBUmedian), hjust=-0.5) +
theme(axis.text.y = element_text(size = 6)) +
coord_flip()# plot ABV median distribution using ABVmedianState data frame
ABVmedianState$State <- factor(ABVmedianState$State, level = ABVmedianState$State[order(ABVmedianState$ABVmedian)])
ABVmedianState %>% ggplot(aes(x=State, y=ABVmedian, fill=State)) + geom_bar(stat = "identity") + ggtitle("ABV Median in States")+
geom_text(aes(label=ABVmedian), hjust=-0.5) +
theme(axis.text.y = element_text(size = 6)) +
coord_flip()# plot scatter with 51 pairs of median values of ABV and IBU
ABV_IBU_mergeDF <- merge(ABVmedianState, IBUmedianState, by = "State" )
ABV_IBU_mergeDF$State <- factor(ABV_IBU_mergeDF$State, droplevels(ABV_IBU_mergeDF$State))
ABV_IBU_mergeDF %>% ggplot(aes(x=IBUmedian, y=ABVmedian)) + geom_point()5. Which state has the maximum alcoholic (ABV) beer? Which state has the most bitter (IBU) beer?
Maximum ABV and IBU data have been sorted and plotted.
# Calculate IBU and ABV max value of each state
IBUmaxState <- tidyDF %>% filter(!is.na(IBU)) %>% group_by(State) %>% dplyr::summarise(IBUmax=max(IBU))
ABVmaxState <- tidyDF %>% filter(!is.na(ABV)) %>% group_by(State) %>% dplyr::summarise(ABVmax=max(ABV))
# plot IBU max distribution using IBUmedianState data frame
IBUmaxState$State <- factor(IBUmaxState$State, level = IBUmaxState$State[order(IBUmaxState$IBUmax)])
IBUmaxState %>% ggplot(aes(x=State, y=IBUmax, fill=State)) + geom_bar(stat = "identity") + ggtitle("IBU max in States")+
geom_text(aes(label=IBUmax), hjust=-0.5) +
theme(axis.text.y = element_text(size = 6)) +
coord_flip()# plot ABV max distribution using ABVmaxState data frame
ABVmaxState$State <- factor(ABVmaxState$State, level = ABVmaxState$State[order(ABVmaxState$ABVmax)])
ABVmaxState %>% ggplot(aes(x=State, y=ABVmax, fill=State)) + geom_bar(stat = "identity") + ggtitle("ABV max in States")+
geom_text(aes(label=ABVmax), hjust=-0.5) +
theme(axis.text.y = element_text(size = 6)) +
coord_flip()Summary
Through our analysis:…
6. Comment on the summary statistics and distribution of the ABV variable.
Statistics of ABV variable can be found with summary function, boxplot, barchart and scatter plots have been provided.
7. Is there an apparent relationship between the bitterness of the beer and its alcoholic content? Draw a scatter plot. Make your best judgment of a relationship and EXPLAIN your answer.
Scatter plots with linear model smooth function are provided, the correlation between IBU and ABV is 0.757878, with a p-value < 2e-16, Multiple R-squared is 0.5744, meaning that 57% changes in exploratory variable ABV can explain the changes in response variable IBU.
8. Budweiser would also like to investigate the difference with respect to IBU and ABV between IPAs (India Pale Ales) and other types of Ale (any beer with “Ale” in its name other than IPA). You decide to use KNN clustering to investigate this relationship. Provide statistical evidence one way or the other. You can of course assume your audience is comfortable with percentages … KNN is very easy to understand.
A category variable IPA_Ale is created to split all the 2410 beers into three groups: Ale (1007), IPA (571) and Other (832), we only use the first two groups for analysis in this part. With this KNN classification model, we get an estimated accuracy of 0.7893 with k=5.
9. Knock their socks off! Find one other useful inference from the data that you feel Budweiser may be able to find value in. You must convince them why it is important and back up your conviction with appropriate statistical evidence.